EDA Case Study: House Price

Task Description

House Prices is a classical Kaggle competition. The task is to predicts final price of each house. For more detail, refer to https://www.kaggle.com/c/house-prices-advanced-regression-techniques/.

Goal of this notebook

As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/khandelwallaksya/house-prices-eda for a reference. In this notebook, we will show how dataprep.eda can simply the eda process using a few lines of code.

In conclusion: * Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem. * Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it. * Multivariate study. We’ll try to understand how the dependent variable and independent variables relate. * Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.

Import libraries

[1]:
from dataprep.eda import plot
from dataprep.eda import plot_correlation
from dataprep.eda import plot_missing
from dataprep.datasets import load_dataset

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
sns.set(font_scale=1)

Load data

[2]:
houses = load_dataset("house_prices_train")
houses.head()
[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

[3]:
houses_test = load_dataset("house_prices_test")
houses_test.head()
[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

[4]:
houses.shape
[4]:
(1460, 81)

There are total 1460 tuples, each tuple contains 80 features and 1 target value.

[5]:
houses_test.shape
[5]:
(1459, 80)

Variable identification

[6]:
plot(houses)
[6]:
DataPrep.EDA Report

Dataset Statistics

Number of Variables 81
Number of Rows 1460
Missing Cells 348
Missing Cells (%) 0.3%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 4.1 MB
Average Row Size in Memory 2.8 KB
Variable Types
  • Numerical: 38
  • Categorical: 43

Dataset Insights

Id is uniformly distributed Uniform
BsmtFinSF2 and EnclosedPorch have similar distributions Similar Distribution
LowQualFinSF and BsmtHalfBath have similar distributions Similar Distribution
LowQualFinSF and 3SsnPorch have similar distributions Similar Distribution
LowQualFinSF and ScreenPorch have similar distributions Similar Distribution
LowQualFinSF and PoolArea have similar distributions Similar Distribution
LowQualFinSF and MiscVal have similar distributions Similar Distribution
BsmtFullBath and HalfBath have similar distributions Similar Distribution
BsmtHalfBath and 3SsnPorch have similar distributions Similar Distribution
BsmtHalfBath and PoolArea have similar distributions Similar Distribution

Dataset Insights

BsmtHalfBath and MiscVal have similar distributions Similar Distribution
3SsnPorch and ScreenPorch have similar distributions Similar Distribution
3SsnPorch and PoolArea have similar distributions Similar Distribution
3SsnPorch and MiscVal have similar distributions Similar Distribution
ScreenPorch and MiscVal have similar distributions Similar Distribution
PoolArea and MiscVal have similar distributions Similar Distribution
LotFrontage has 259 (17.74%) missing values Missing
GarageYrBlt has 81 (5.55%) missing values Missing
MSSubClass is skewed Skewed
LotFrontage is skewed Skewed

Dataset Insights

LotArea is skewed Skewed
OverallQual is skewed Skewed
OverallCond is skewed Skewed
YearBuilt is skewed Skewed
YearRemodAdd is skewed Skewed
MasVnrArea is skewed Skewed
BsmtFinSF1 is skewed Skewed
BsmtFinSF2 is skewed Skewed
TotalBsmtSF is skewed Skewed
2ndFlrSF is skewed Skewed

Dataset Insights

LowQualFinSF is skewed Skewed
BsmtFullBath is skewed Skewed
BsmtHalfBath is skewed Skewed
FullBath is skewed Skewed
HalfBath is skewed Skewed
BedroomAbvGr is skewed Skewed
KitchenAbvGr is skewed Skewed
TotRmsAbvGrd is skewed Skewed
Fireplaces is skewed Skewed
GarageCars is skewed Skewed

Dataset Insights

WoodDeckSF is skewed Skewed
OpenPorchSF is skewed Skewed
EnclosedPorch is skewed Skewed
3SsnPorch is skewed Skewed
ScreenPorch is skewed Skewed
PoolArea is skewed Skewed
MiscVal is skewed Skewed
MoSold is skewed Skewed
YrSold is skewed Skewed
LotFrontage has 259 (17.74%) infinite values Infinity

Dataset Insights

GarageYrBlt has 81 (5.55%) infinite values Infinity
Street has constant length 4 Constant Length
LotShape has constant length 3 Constant Length
LandContour has constant length 3 Constant Length
Utilities has constant length 6 Constant Length
LandSlope has constant length 3 Constant Length
ExterQual has constant length 2 Constant Length
ExterCond has constant length 2 Constant Length
BsmtFinType1 has constant length 3 Constant Length
BsmtFinType2 has constant length 3 Constant Length

Dataset Insights

HeatingQC has constant length 2 Constant Length
CentralAir has constant length 1 Constant Length
KitchenQual has constant length 2 Constant Length
GarageFinish has constant length 3 Constant Length
PavedDrive has constant length 1 Constant Length
MasVnrArea has 861 (58.97%) zeros Zeros
BsmtFinSF1 has 467 (31.99%) zeros Zeros
BsmtFinSF2 has 1293 (88.56%) zeros Zeros
BsmtUnfSF has 118 (8.08%) zeros Zeros
2ndFlrSF has 829 (56.78%) zeros Zeros

Dataset Insights

LowQualFinSF has 1434 (98.22%) zeros Zeros
BsmtFullBath has 856 (58.63%) zeros Zeros
BsmtHalfBath has 1378 (94.38%) zeros Zeros
HalfBath has 913 (62.53%) zeros Zeros
Fireplaces has 690 (47.26%) zeros Zeros
GarageCars has 81 (5.55%) zeros Zeros
GarageArea has 81 (5.55%) zeros Zeros
WoodDeckSF has 761 (52.12%) zeros Zeros
OpenPorchSF has 656 (44.93%) zeros Zeros
EnclosedPorch has 1252 (85.75%) zeros Zeros

Dataset Insights

3SsnPorch has 1436 (98.36%) zeros Zeros
ScreenPorch has 1344 (92.05%) zeros Zeros
PoolArea has 1453 (99.52%) zeros Zeros
MiscVal has 1408 (96.44%) zeros Zeros
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Overview of the data

We could get the following information: * Variable-Variable name * Type-There are 43 categorical columns and 38 numerical columns. * Missing value-How many missing values each column contains. For instance, Fence contains 80.8% * 1460 = 1180 missing tuples. Usually, some model does not allow the input data contains missing value such as SVM, we have to clean the data before we utilize it. * Target Value-The distribution of target value (SalePrice). According to the distribution of the target value, we could get the information that the target value is numerical and the distribution of the target value conforms to the norm distribution. Thus, we are not confronted with imbalanced classes problem. It is really great. * Guess-According to the columns’ name, we reckon GrLivArea, YearBuilt and OverallQual are likely to be correlated to the target value (SalePrice).

Correlation in data

[7]:
plot_correlation(houses, "SalePrice")
[7]:
DataPrep.EDA Report
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
[8]:
plot_correlation(houses, "SalePrice", value_range=[0.5, 1])
[8]:
DataPrep.EDA Report
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot

OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd have more than 0.5 Pearson correlation with SalePrice.

OverallQual, GrLivArea, GarageCars, YearBuilt, GarageArea, FullBath, TotalBsmtSF, GarageYrBlt, 1stFlrSF, YearRemodAdd, TotRmsAbvGrd and Fireplaces have more than 0.5 Spearman correlation with SalePrice.

OverallQual, GarageCars, GrLivArea and FullBath have more than 0.5 KendallTau correlation with SalePrice.

EnclosedPorch and KitchenAbvGr have little negative correlation with target variable.

These can prove to be important features to predict SalePrice.

Heatmap

[9]:
plot_correlation(houses)
[9]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation -0.495 -0.574 -0.413
Lowest Correlation 0.0 0.0 0.0
Mean Correlation 0.096 0.098 0.078
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: None
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (BsmtUnfSF, KitchenAbvGr)

In summary

In my opinion, this heatmap is the best way to get a quick overview of features’ relationships.

At first sight, there are two red colored squares that get my attention. The first one refers to the ‘TotalBsmtSF’ and ‘1stFlrSF’ variables, and the second one refers to the ‘GarageX’ variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.

Another thing that got my attention was the ‘SalePrice’ correlations. We can see our well-known ‘GrLivArea’, ‘TotalBsmtSF’, and ‘OverallQual’, but we can also see many other variables that should be taken into account. That’s what we will do next.

[10]:
plot_correlation(houses[["SalePrice","OverallQual","GrLivArea","GarageCars",
                  "GarageArea","GarageYrBlt","TotalBsmtSF","1stFlrSF","FullBath",
                  "TotRmsAbvGrd","YearBuilt","YearRemodAdd"]])
[10]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation 0.0 0.0 0.0
Lowest Correlation 0.096 0.177 0.131
Mean Correlation 0.442 0.461 0.364
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: None
  • Least correlated: (TotRmsAbvGrd, YearBuilt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageYrBlt, YearBuilt)
  • Most negative correlated: None
  • Least correlated: (TotRmsAbvGrd, YearBuilt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageYrBlt, YearBuilt)
  • Most negative correlated: None
  • Least correlated: (TotRmsAbvGrd, YearBuilt)

As we saw above there are few feature which shows high multicollinearity from heatmap. Lets focus on red squares on diagonal line and few on the sides.

SalePrice and OverallQual

GarageArea and GarageCars

TotalBsmtSF and 1stFlrSF

GrLiveArea and TotRmsAbvGrd

YearBulit and GarageYrBlt

We have to create a single feature from them before we use them as predictors.

[11]:
plot_correlation(houses, value_range=[0.5, 1])
[11]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation -0.495 -0.574 -0.413
Lowest Correlation 0.0 0.0 0.0
Mean Correlation 0.096 0.098 0.078
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: None
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (BsmtUnfSF, KitchenAbvGr)
[12]:
plot_correlation(houses, k=30)
[12]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation -0.495 -0.574 -0.413
Lowest Correlation 0.0 0.0 0.0
Mean Correlation 0.096 0.098 0.078
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: None
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (BsmtUnfSF, KitchenAbvGr)

Attribute Pair Correlation

7 (GarageArea, GarageCars) 0.882475

11 (GarageYrBlt, YearBuilt) 0.825667

15 (GrLivArea, TotRmsAbvGrd) 0.825489

18 (1stFlrSF, TotalBsmtSF) 0.819530

19 (2ndFlrSF, GrLivArea) 0.687501

9 (BedroomAbvGr, TotRmsAbvGrd) 0.676620

0 (BsmtFinSF1, BsmtFullBath) 0.649212

2 (GarageYrBlt, YearRemodAdd) 0.642277

24 (FullBath, GrLivArea) 0.630012

8 (2ndFlrSF, TotRmsAbvGrd) 0.616423

1 (2ndFlrSF, HalfBath) 0.609707

4 (GarageCars, OverallQual) 0.600671

16 (GrLivArea, OverallQual) 0.593007

23 (YearBuilt, YearRemodAdd) 0.592855

22 (GarageCars, GarageYrBlt) 0.588920

12 (OverallQual, YearBuilt) 0.572323

5 (1stFlrSF, GrLivArea) 0.566024

25 (GarageArea, GarageYrBlt) 0.564567

6 (GarageArea, OverallQual) 0.562022

17 (FullBath, TotRmsAbvGrd) 0.554784

13 (OverallQual, YearRemodAdd) 0.550684

14 (FullBath, OverallQual) 0.550600

3 (GarageYrBlt, OverallQual) 0.547766

10 (GarageCars, YearBuilt) 0.537850

27 (OverallQual, TotalBsmtSF) 0.537808

20 (BsmtFinSF1, TotalBsmtSF) 0.522396

21 (BedroomAbvGr, GrLivArea) 0.521270

26 (2ndFlrSF, BedroomAbvGr) 0.502901

This shows multicollinearity. In regression, “multicollinearity” refers to features that are correlated with other features. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your target variable, but also to each other.

Problem:

Multicollinearity increases the standard errors of the coefficients. That means, multicollinearity makes some variables statistically insignificant when they should be significant.

To avoid this we can do 3 things:

Completely remove those variables Make new feature by adding them or by some other operation. Use PCA, which will reduce feature set to small number of non-collinear features. Reference:http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis

Univariate Analysis

How 1 single variable is distributed in numeric range. What is statistical summary of it. Is it positively skewed or negatively.

[13]:
plot(houses, "SalePrice")
[13]:
DataPrep.EDA Report

Overview

Distinct Count663
Unique (%)45.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size22.8 KB
Mean180921.1959
Minimum34900
Maximum755000
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum34900
5-th Percentile88000
Q1129975
Median163000
Q3214000
95-th Percentile326100
Maximum755000
Range720100
IQR84025

Descriptive Statistics

Mean180921.1959
Standard Deviation79442.5029
Variance6.3111e+09
Sum2.6414e+08
Skewness1.8809
Kurtosis6.5098
Coefficient of Variation0.4391
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • SalePrice is skewed right (γ1 = 1.8809)
'kde.bins': 50
Number of bins in the histogram
'kde.yscale': 'linear'
Y-axis scale ("linear" or "log")
'kde.hist_color': '#aec7e8'
Color of the density histogram
'kde.line_color': '#d62728'
Color of the density line
'height': 400
Height of the plot
'width': 450
Width of the plot
'qqnorm.point_color': #1f77b4
Color of the points
'qqnorm.line_color': #d62728
Color of the line
'height': 400
Height of the plot
'width': 450
Width of the plot
  • SalePrice is not normally distributed (p-value 2.792946582081966e-06)
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • SalePrice has 61 outliers

Pivotal Features

[14]:
plot_correlation(houses, "OverallQual", "SalePrice")
[14]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 400
Width of the plot
[15]:
plot(houses, "OverallQual", "SalePrice") # why not combine them together?
[15]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 0.36
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[16]:
plot(houses, "GarageCars", "SalePrice")
[16]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 0.16
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[17]:
plot(houses, "Fireplaces", "SalePrice")
[17]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 0.12
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[18]:
plot(houses, "GrLivArea", "SalePrice")
[18]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 212.32
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[19]:
plot(houses, "TotalBsmtSF", "SalePrice")
[19]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 244.4
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[20]:
plot(houses, "YearBuilt", "SalePrice")
[20]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 5.52
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot

In summary

Based on the above analysis, we can conclude that:

‘GrLivArea’ and ‘TotalBsmtSF’ seem to be linearly related with ‘SalePrice’. Both relationships are positive, which means that as one variable increases, the other also increases. In the case of ‘TotalBsmtSF’, we can see that the slope of the linear relationship is particularly high. ‘OverallQual’ and ‘YearBuilt’ also seem to be related with ‘SalePrice’. The relationship seems to be stronger in the case of ‘OverallQual’, where the box plot shows how sales prices increase with the overall quality. We just analysed four variables, but there are many other that we should analyse. The trick here seems to be the choice of the right features (feature selection) and not the definition of complex relationships between them (feature engineering).

That said, let’s separate the wheat from the chaff.

Missing Value Imputation

Missing values in the training data set can affect prediction or classification of a model negatively.

Also some machine learning algorithms can’t accept missing data eg. SVM, Neural Network.

But filling missing values with mean/median/mode or using another predictive model to predict missing values is also a prediction which may not be 100% accurate, instead you can use models like Decision Trees and Random Forest which handle missing values very well.

Some of this part is based on this kernel: https://www.kaggle.com/bisaria/house-prices-advanced-regression-techniques/handling-missing-data

[21]:
plot_missing(houses)
[21]:
DataPrep.EDA Report

Missing Statistics

Missing Cells6965
Missing Cells (%)5.9%
Missing Columns19
Missing Rows1460
Avg Missing Cells per Column85.99
Avg Missing Cells per Row4.77
'height': 400
Height of the plot
'width': 400
Width of the plot
'spectrum.bins': 20
Number of bins
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
[22]:
# plot_missing(houses, "BsmtQual")
basement_cols=['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','BsmtFinSF1','BsmtFinSF2']
houses[basement_cols][houses['BsmtQual'].isnull()==True]
[22]:
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 BsmtFinSF1 BsmtFinSF2
17 NaN NaN NaN NaN NaN 0 0
39 NaN NaN NaN NaN NaN 0 0
90 NaN NaN NaN NaN NaN 0 0
102 NaN NaN NaN NaN NaN 0 0
156 NaN NaN NaN NaN NaN 0 0
182 NaN NaN NaN NaN NaN 0 0
259 NaN NaN NaN NaN NaN 0 0
342 NaN NaN NaN NaN NaN 0 0
362 NaN NaN NaN NaN NaN 0 0
371 NaN NaN NaN NaN NaN 0 0
392 NaN NaN NaN NaN NaN 0 0
520 NaN NaN NaN NaN NaN 0 0
532 NaN NaN NaN NaN NaN 0 0
533 NaN NaN NaN NaN NaN 0 0
553 NaN NaN NaN NaN NaN 0 0
646 NaN NaN NaN NaN NaN 0 0
705 NaN NaN NaN NaN NaN 0 0
736 NaN NaN NaN NaN NaN 0 0
749 NaN NaN NaN NaN NaN 0 0
778 NaN NaN NaN NaN NaN 0 0
868 NaN NaN NaN NaN NaN 0 0
894 NaN NaN NaN NaN NaN 0 0
897 NaN NaN NaN NaN NaN 0 0
984 NaN NaN NaN NaN NaN 0 0
1000 NaN NaN NaN NaN NaN 0 0
1011 NaN NaN NaN NaN NaN 0 0
1035 NaN NaN NaN NaN NaN 0 0
1045 NaN NaN NaN NaN NaN 0 0
1048 NaN NaN NaN NaN NaN 0 0
1049 NaN NaN NaN NaN NaN 0 0
1090 NaN NaN NaN NaN NaN 0 0
1179 NaN NaN NaN NaN NaN 0 0
1216 NaN NaN NaN NaN NaN 0 0
1218 NaN NaN NaN NaN NaN 0 0
1232 NaN NaN NaN NaN NaN 0 0
1321 NaN NaN NaN NaN NaN 0 0
1412 NaN NaN NaN NaN NaN 0 0

All categorical variables contains NAN whereas continuous ones have 0. So that means there is no basement for those houses. we can replace it with ‘None’.

[23]:
for col in basement_cols:
    if 'FinSF'not in col:
        houses[col] = houses[col].fillna('None')
[24]:
# plot_missing(houses, "FireplaceQu")
houses["FireplaceQu"] = houses["FireplaceQu"].fillna('None')
pd.crosstab(houses.Fireplaces, houses.FireplaceQu)
[24]:
FireplaceQu Ex Fa Gd None Po TA
Fireplaces
0 0 0 0 690 0 0
1 19 28 324 0 20 259
2 4 4 54 0 0 53
3 1 1 2 0 0 1
[25]:
garage_cols=['GarageType','GarageQual','GarageCond','GarageYrBlt','GarageFinish','GarageCars','GarageArea']
houses[garage_cols][houses['GarageType'].isnull()==True]
[25]:
GarageType GarageQual GarageCond GarageYrBlt GarageFinish GarageCars GarageArea
39 NaN NaN NaN NaN NaN 0 0
48 NaN NaN NaN NaN NaN 0 0
78 NaN NaN NaN NaN NaN 0 0
88 NaN NaN NaN NaN NaN 0 0
89 NaN NaN NaN NaN NaN 0 0
... ... ... ... ... ... ... ...
1349 NaN NaN NaN NaN NaN 0 0
1407 NaN NaN NaN NaN NaN 0 0
1449 NaN NaN NaN NaN NaN 0 0
1450 NaN NaN NaN NaN NaN 0 0
1453 NaN NaN NaN NaN NaN 0 0

81 rows × 7 columns

All garage related features are missing values in same rows. that means we can replace categorical variables with None and continuous ones with 0.

[26]:
for col in garage_cols:
    if houses[col].dtype==np.object:
        houses[col] = houses[col].fillna('None')
    else:
        houses[col] = houses[col].fillna(0)